Sample Prevalence vs Global Prevalence
Cross-posted from my NAO Notebook. Thanks to Evan Fields and Mike McLaren for editorial feedback on this post.
In Detecting Genetically Engineered Viruses With Metagenomic Sequencing we have:
our best guess is that if this system were deployed at the scale of approximately $1.5M/y it could detect something genetically engineered that shed like SARS-CoV-2 before 0.2% of people in the monitored sewersheds had been infected.
I want to focus on the last bit: “in the monitored sewersheds”. The idea is, if a system like this is tracking wastewater from New York City, its ability to raise an alert for a new pandemic will depend on how far along that pandemic is in that particular city. This is closely related to another question: what fraction of the global population would have to be infected before it could raise an alert?
There are two main considerations pushing in opposite directions, both based on the observation that the pandemic will be farther along in some places than others:
With so many places in the world where a pandemic might start, the chance that it starts in NYC is quite low. To take the example of COVID-19, when the first handful of people were sick they were all in one city in China. Initially, prevalence in monitored sewersheds in other parts of the world will be zero, while global prevalence will be greater than zero. This effect should diminish as the pandemic progresses, but at least in the <1% cumulative incidence situations I’m most interested in it should remain a significant factor. This pushes prevalence in your sample population to lag prevalence in the global population.
NYC is a highly connected city: lots of people travel between there and other parts of the world. Since pandemics spread as people move around, places with many long-distance travelers will generally be infected before places with few. While if you were monitoring an isolated sewershed you’d expect this factor to cause an additional lag in your sample prevalence, if you specifically choose places like NYC we expect instead the high connectivity to reduce lag relative to global prevalence, and potentially even to lead global prevalence.
My guess is that with a single monitored city, even the optimal one (which one is that even?) your sample prevalence will significantly lag global prevalence in most pandemics, but by carefully choosing a few cities to monitor around the world you can probably get to where it leads global prevalence. But I would love to see some research and modeling on this: qualitative intutitions don’t take us very far. Specifically:
How does prevalence at a highly-connected site compare to global prevalence during the beginning of a pandemic?
What if you instead are monitoring a collection of highly-connected sites?
What does the diminishing returns curve look like for bringing additional sites up? Does it go negative at some point, where you are sampling so many excellent sites that the marginal site is mostly dilutative?
If you look at the initial spread of SARS-CoV-2, how much of the variance in when places were infected is explained by how connected they are?
What about with data from the spread of influenza and SARS-CoV-2 variants?
Are there other major factors aside from connectedness that lead to earlier infection? Can we model how valuable different sites are to sample, in a way that can be combined with how operationally difficult it is to sample in various places?
If you know of good work on these sorts of modeling questions or are
interested in collaborating on them, please get in touch! My work email is
jeff
at securebio.org
.
Jeff, your notes on NAO are fascinating to read! I have nothing to add other than that I hope you keep posting them
Thanks!
Great post—I really enjoyed reading this.
I would have thought the standard way to resolve some of the questions above would be to use a large agent-based model, simulating disease transmission among millions of agents and then observing how successful some testing scheme is within the model (you might be able to backtest the model against well-documented outbreaks).
I’m not sure how much you’d trust these models over your intuitions, but I’d guess they’d have quite a lot of mileage.
I’ve only skimmed these papers, but these seem promising and illustrative of the direction to me:
Scaling of agent-based models to evaluate transmission risks of infectious diseases
3D Agent-Based Model of Pedestrian Movements for Simulating COVID-19 Transmission in University Students
BioWar: scalable agent-based model of bioattacks
The best stuff looking at global-scale analysis of epidemics is probably by GLEAM. I doubt full agent-based modelling at small-scales is giving you much but massively complicating the model.
1% cumulative incidence is quite high, so I think this is probably far along you’re fine. E.g. we’ve estimated London hit this point for COVID around 22 Mar 2020 when it was pretty much everywhere.
I’m not sure what you mean by this?
(Yes, 1% cumulative incidence is high—I wish the NAO were funded to the point that we could be talking about whether 0.01% or 0.001% was achievable.)
Sorry, I answered the wrong question, and am slightly confused what this post is trying to get out. I think your question is: will NYC hit 1% cumulative incidence after global 1% cumulative incidence?
I think this is almost never going to be the case for fairly indiscriminately-spreading respiratory pathogens, such as flu or COVID.
The answer is yes only if NYC’s cumulative incidence is lower than the global mean region (weighted by population). Due to connectedness, I expect NYC to always be hit pretty early, as you point out, definitely before most rural communities. I think the key point here is that NYC doesn’t need to be ahead of the epicentre of the disease, only the global mean.
One way of looking at this is how early on does NYC get hit compared to other cities/regions. This analysis (pdf) orders cities by connectedness to Wuhan to answer this question for COVID. It looks like they’ve released an online tool that lets you specify different origin locations and epidemiological parameters. So you could rank how early NYC gets hit for a range of different scenarios.
This would surprise me. It’s hard to imagine a scenario where the arrival time at different major travel hubs is very desynchronized as these locations are highly connected to each other. So you’d probably then end up looking at a long tail of locations which are poorly connected to the main travel hubs.
That’s one of the main questions, yes.
The core idea is that our efficacy simulations are in terms of cumulative incidence in a monitored population, but what people generally care about is cumulative incidence in the global (or a specific country’s) population.
Thanks! The tool is neat, and it’s close to the approach I’d want to see.
I don’t see how you can say both that it will “almost never” be the case that NYC will “hit 1% cumulative incidence after global 1% cumulative incidence” but also that it would surprise you if you can get to where your monitored cities lead global prevalence?
Sorry, this is poorly phrased by me. I meant that it would surprise me if there’s much benefit from adding a few additional cities.
Possibly! That would certainly be a convenient finding (from my perspective) if it did end up working out that way.
Thank you, this is fascinating. Is there an option to monitor wastewater just from airports (as well as generally for a whole city)? Then anything brought in on international flights might be less diluted and you might be able to detect it sooner, idk?
I realise that the world is a little bit different than in 1918, but given that the Spanish Flu was spread by troop movements, I wonder what the various militaries are doing and if they see themselves as having a role in pandemic prevention?
The NAO ran a pilot where we worked with the CDC and Ginkgo to collect and sequence pooled airplane toilet waste. We haven’t sequenced these samples as deeply as we would like to yet, but initial results look very promising.
Militaries are generally interested in this kind of thing, but primarily as biodefense: protecting the population and service members.
Thanks for the post! This may not be helpful, but one thing I would be curious to see would be how the dispersion coefficient k (Discussed here; I’m sure there’s a better reference source) affected the importance of having many sites. With COVID, a lot of transmission came from superspreader events, which intuitively would increase the variance of how quickly it spread in different sites. On the other hand, the flu has a low proportion of superspreader events, so testing in a well connected site might explain more of the variance?
I haven’t done or seen any modeling on this, but intuitively I would expect the variance due to superspreading to have most of its impact in the very early days, when single superspreading events can meaningfully accelerate the progress of the pandemic in a specific location, and to be minimal by the time you get to ~1% cumulative incidence?